EpiChord: Parallelizing the Chord Lookup Algorithm with 
Reactive Routing State Management 
Technical Report MIT-LCS-TR-963 


Ben Leong 
benleong @mit.edu 


Abstract— EpiChord is a DHT lookup algorithm that 
demonstrates that we can remove the O(log n)-state-per- 
node restriction on existing DHT topologies to achieve sig- 
nificantly better lookup performance and resilience using 
a novel reactive routing state maintenance strategy that 
amortizes network maintenance costs into existing lookups 
and by issuing parallel queries. Our technique allows us to 
design a new class of unlimited-state-per-node DHTs that is 
able to adapt naturally to a wide range of lookup workloads. 
EpiChord is able to achieve O(1)-hop lookup performance 
under lookup-intensive workloads, and at least O(log n)- 
hop lookup performance under churn-intensive workloads 
even in the worst case (though it is expected to perform bet- 
ter on average). 

Our reactive routing state maintenance strategy allows 
us to maintain large amounts of routing state with only a 
modest amount of bandwidth, while parallel queries serve 
to reduce lookup latency and allow us to avoid costly lookup 
timeouts. In general, EpiChord exploits the information 
gleaned from observing lookup traffic to improve lookup 
performance, and only sends network probes when nec- 
essary. Nodes populate their caches mainly from observ- 
ing network traffic, and cache entries are flushed from the 
cache after a fixed lifetime. 

Our simulations show that with our approach can reduce 
both lookup latencies and path lengths by a factor of 3 by is- 
suing only 3 queries asynchronously in parallel per lookup. 
Furthermore, we show that we are able to achieve this result 
with minimal additional communication overhead and the 
number of messages generated per lookup is no more than 
that for the corresponding sequential Chord lookup algo- 
rithm over a range of lookup workloads. We also present a 
novel token-passing stabilization scheme that automatically 
detects and repairs global routing inconsistencies. 


I. INTRODUCTION 


In recent years, more than a dozen DHT lookup al- 
gorithms and routing topologies have been proposed [1], 
[2], [3], [4], [5], [6], [7], [8], [9], [10], [11], [12], [13]. 
DHTs are important to distributed systems research be- 
cause they offer a scalable and efficient routing and object 
location platform for self-organizing peer-to-peer overlay 
networks. DHTs are expected to become a fundamental 
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building block of future large-scale distributed systems. 
While most of the initial DHT research was directed to- 
wards minimizing the amount of routing state per node, 
more recent research has demonstrated that it is reason- 
able to attempt to store a global lookup table at every node 
to achieve one-hop lookup, when network churn is rela- 
tively low or if enough bandwidth is available, since local 
storage is relatively cheap [13]. 

The DHT designs and the various DHT-related tech- 
niques that have been proposed, e.g., proximity neighbor 
selection [14], synthetic coordinates [15], [16], erasure 
coding [17] and integrated P2P transport protocol [18], 
essentially allow us to trade off different amounts of stor- 
age and background maintenance bandwidth for better or 
worse lookup performance in a variety of ways. In this pa- 
per, we describe EpiChord, a DHT that demonstrates that 
we can remove the state storage restriction on O(log n)- 
state DHTs! to achieve better lookup performance using 
a novel reactive routing state maintenance strategy and by 
issuing multiple queries asynchronously in parallel. Our 
technique allows us to design a new class of unlimited- 
state-per-node DHTs that is able to adapt naturally to 
a wide range of lookup workloads. EpiChord is able 
to achieve O(1)-hop lookup performance under lookup- 
intensive workloads, and at least O(log n)-hop lookup 
performance under churn-intensive workloads even in the 
worst case, though it is expected to perform better on av- 
erage. 

While existing DHTs tend to decouple the lookup pro- 
cess from routing state maintenance and adopt a proactive 
routing state management strategy where nodes probe all 
(or at least most of) their routing entries periodically to en- 
sure that they are alive, EpiChord employs a reactive rout- 
ing state management strategy where routing state main- 
tenance costs are amortized into the lookup costs. Nodes 
rely mainly on observing lookup traffic and on piggyback- 


‘It is known that limiting the amount of state stored per node to 
O(log n) limits the average lookup path length to no better than 
O(log n/loglogn) hops per lookup. Koorde [10] achieves this 
O(log n/ log log n)-hop lower bound. 


ing additional network information on query replies to 
keep their routing state up-to-date under reasonable traf- 
fic conditions. EpiChord only sends probes as a backup 
mechanism if lookup traffic levels are too low to support 
the desired level of performance. 

Our reactive routing state maintenance strategy does 
not keep routing state quite as up-to-date as a proactive 
strategy, and therefore we use parallel lookups to amelio- 
rate the costs of keeping outdated routing state. In par- 
ticular, there is a synergistic relationship between large 
(> O(log n)) state and parallel lookups in our approach: 
while parallel queries allow us to avoid lookup timeouts 
due to stale routing entries, we can afford to issue parallel 
queries without generating excessive amounts of lookup 
traffic only because our large routing state reduces the 
number of hops per lookup and thereby the number of 
lookup messages. 

Although one might expect a parallel lookup algorithm 
to generate significantly more lookup traffic and thereby 
consume significantly more network bandwidth, we show 
that we are able in practice to achieve significantly better 
lookup performance on average (both in terms of lookup 
path length and latency) than that for the correspond- 
ing sequential Chord lookup algorithm with comparable 
amounts of lookup traffic. 

Our goal in this work is not to design the perfect DHT. 
Rather, our main objective is to explore and quantify the 
performance-cost trade-offs in moving from an O(log n)- 
state-per-node DHT topology to an unlimited-state-per- 
node architecture, by adopting a reactive routing state 
management strategy and using parallel queries. Conse- 
quently, we compare EpiChord to the optimal” sequential 
Chord lookup algorithm. Our parallel lookup algorithm is 
simple and effective, and our reactive approach to routing 
state maintenance allows our DHT to adapt naturally to a 
range of lookup workloads. 


Il. OVERVIEW 


Like Chord [2], EpiChord is organized as a one- 
dimensional circular address space where each node is 
assigned a unique node identifier (td). As shown in Fig- 
ure 1, the node responsible for a key is the node whose id 
most closely follows the key, which we also call the suc- 
cessor’. We use the cryptographic hash function SHA-1 


By optimal, we mean that we ignore Chord maintenance costs and 
assume that the finger tables of the Chord nodes have perfectly accu- 
rate finger entries at all times regardless of node failures. The compet- 
ing sequential lookup algorithm is thus a reasonably strong adversary 
and not just a straw man. 

°The choice of which node to be responsible for a key is somewhat 
arbitrary. We could have decided to map a key to the node whose id 


[19] to determine the node id of a new node. SHA-1 en- 
sures that with high probability, the node zds do not col- 
lide (when the address space is sufficiently large, i.e. 128 
bits) and are uniformly distributed over the entire circular 
id address space. 
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Fig. 1. 
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Circular identifier address space with twenty nodes and five 


A. Basic Lookup Algorithm 


To look up a given 7d, node «x initiates p queries in par- 
allel to the node immediately succeeding id and to the 
p — 1 nodes preceding 7d, within the set of nodes known 
to it (see Figure 2). Probing the succeeding node gives us 
a chance of locating the destination node in one hop. 
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Fig. 2. _ Initial cache entries returned from cache for node x for a 
lookup of id. 


We adopt two simple policies to learn new routing en- 
tries. (i) When a node first joins the network, it obtains a 
full cache transfer from one of its two immediate neigh- 
bors. (ii) Nodes gather information by observing lookup 
traffic: a node updates its cache based on information re- 
turned by queries and adds an entry to the cache each time 
it is queried by a node not already in the cache. 

When contacted, a probed node will respond to x as 
follows: 

e If it owns éd, it will simply say so and respond with 

the value associated with id (if one exists) and infor- 
mation about its current immediate predecessor. 


most closely precedes the key, i.e. the predecessor, or the node that has 
the zd closest to the key [4], [6], and our algorithm can still be applied 
with minor modifications. 


e Ifitis a predecessor of 7d relative to x, it will provide 
information about its immediate successor and the / 
“best” next hops to the destination id from its cache’. 

e If itis a successor of éd relative to x, it will provide 
information about its immediate predecessor and the 
l “best” next hops from its cache. 


Here, J, like p, is a system parameter. We call an EpiChord 
network where there are at most p concurrent queries per 
lookup a p-way EpiChord. 

When these replies are received, further queries will be 
dispatched asynchronously in parallel if x learns about 
nodes that are closer to the target id than the best suc- 
cessor and predecessor nodes that have already responded. 
An example of a lookup for the network shown in Figure 1 
is given in Figure 3. In this example, p = 3, | = 3 and 
node N32 makes a lookup for the key K2. Note that when 
the lookup terminates, V.32 would have learned about all 
the consecutive nodes in the range from N60 to N9. The 
simplified pseudocode for the lookup algorithm (which is 
implemented with callbacks and continuations) is given in 
Appendix A. 

There are several reasons why queried nodes respond 
with information about their successors or predecessors. 
Firstly, this allows us to check for termination’. Secondly, 
since successors and predecessors are probed relatively 
more frequently than other cache entries, they are likely 
to be alive and hence with high probability, the querying 
node will make at least one step of progress towards the 
target id with each query. Lastly, even if nodes have an 
outdated view of the segment of the zd space that they are 
responsible for, the querying node will be able to detect 
such a situation and resolve a lookup correctly. For ex- 
ample, an inconsistency can arise if the predecessor of a 
given node y is responsible for a queried 7d and it fails 
without informing y. Node y would not know that it is 
now responsible for zd. 

Our lookup algorithm is intrinsically iterative. The 
main reason for this is that an iterative approach allows us 
to avoid sending redundant queries. If we employ paral- 
lel queries in a recursive lookup, nodes at the subsequent 
hops would not know when other nodes respond to the 
original node that issued the lookup, and hence which new 
nodes not to query. In general, such an approach is likely 


“Correspondingly, the J “best” next hops are the node immediately 
succeeding zd and to the | — 1 nodes preceding id. 

°In general, we can terminate a get () lookup operation and return 
when the target zd falls between a responding node and its successor 
or predecessor or whenever a node returns the requested object. How- 
ever, if the node failure rate is high, we may choose to terminate a 
put () lookup operation only after both the best predecessor and best 
successor respond and we check that they are consistent (i.e., that think 
that they are adjacent to each other in the address space). 


to require 2p x h messages (including both queries and 
responses) per lookup, where p is the number of parallel 
queries per hop and /, is the number of hops. With an it- 
erative approach, we usually require only about 2(p + h) 
messages per lookup. 


B. Reactive Cache Management 


Each cache entry has an associated time. When a node 
receives a query or reply, it adds an entry for the sender if 
it is not already in the cache and sets (or resets) the time 
of the entry associated with the sender to that of its local 
clock. Query responses contain a lifetime for each entry, 
equal to the sender’s clock at the time of the send minus 
the node entry’s time in the sender’s cache, and this in- 
formation is used to set or reset the time in the receiver’s 
cache for that node. Node entries are flushed if their asso- 
ciated nodes do not respond to some number of queries or 
when their lifetime exceeds some limit, 7. 


a as 
, \ 
(ee, Senne 
. 
e / 
een 2 e 
(7, ee) oe oe 
increasing ~~ on a 
id OS ; Z 
Nee cal e cache entry 


Fig. 4. Division of address space into exponentially smaller slices 
with respect to node x. 


Like Chord, the correctness of the lookup algorithm is 
guaranteed because a query can always reach the destina- 
tion id by moving sequentially down the successor lists. 
In general, O(logn)-hop DHT routing schemes have a 
predefined set of O(log n) fingers and provide guarantees 
on lookup performance by ensuring that a node knows 
about some nodes in the vicinity of each finger. EpiChord 
divides the address space into two symmetric® sets of ex- 
ponentially smaller slices as shown in Figure 4. For per- 
formance guarantees, a node enforces the following in- 
variant: 


Cache Invariant: Every slice contains at least 


rez cache entries at all times. 


where ¥ is a local estimate of the probability that a cache 
entry is out-of-date (i.e., that the associated node had 


°Tn contrast to the asymmetric Chord finger table, the division of the 
address space into slices is symmetric by design. The key idea is that 
when node x responds to node y, they will each know that each other is 
alive, and if the node entry for y helps = to satisfies its cache invariant 
for a particular slice, we want the node entry for x to also be useful in 
satisfying the invariant for a corresponding slice in y’s cache. 


Node 


Initial cache contents 


-.-, N55, N60, NO, N9, N17, --- 
---, NO, N1, N3, NO, N15, + 
-.-, N60, NI, N3, N8, N15--- 
-.-, NO, N3, N8, NO, N15--- 
---, NO, N1, N8, NO, N17--- 
---, NO, N1, N3, NO, N15: -- 
-.-, N55, NI, N3, N8, N15--- 


Time | Action by N32 Pending Best Best Comment 
Queries Predecessor | Successor 
a eT 
N55 ignored because N60 responded 


a Send queries to N60, NO, N9 
t= 1 | Reply from N60- {NO, N1, N3} 


=4 
=5 | Reply from NO - {N60, N1, N3} 
= Send query to V8 

Reply from N8 - { NO, N1, N3} 


Reply from N3 — found key 2! 


Reply from N1-{NO, N1, N3} 


=7 
= 8 
=9 


Nil 


| NO 
| NO 
| NO 


32 
60 
60 
60 
60 
NO 
60 
NO 
NO 
N1 


N3 
3 


| N3 | lookup returns 


Fig. 3. Example of a lookup for the network shown in Figure 1. In this example, p = 3, | = 2 and node N32 makes a lookup for the key K2. 


failed). A node checks its cache slices periodically and 
ensures that there are sufficient unexpired cache entries 
in each slice. Should a slice be found not to have suffi- 
cient unexpired cache entries, a node makes a lookup to 
the midpoint of that slice. Since 7 is small (e.g. 2), one 
lookup is usually all it takes to satisfy the cache invariant. 

The key idea is that to provide an O(log n)-hop guar- 
antee on the lookup path length, the density of entries per 
slice must increase exponentially as we get nearer to the 
node’s id. EpiChord estimates the number of slices from 
its k successors and k predecessors: it requires that the 
successor and predecessor lists fall into the two adjacent 
slices closest to the reference node. This implies that we 
need to choose j and k such that & > 27. 

To estimate y, the probability that a given cache entry 
is stale, each node tracks two variables: 

e 7,, the number of nodes probed 

e 7 ,, the number of probed nodes that timed out 
We estimate y with: 


y= () 
Pp 

In addition, we multiply n, and n; by 6, periodically 
(i.e., when the cache is flushed) to obtain exponentially 
weighted moving averages for both estimates. We weight 
the raw values instead of periodically computed ratios be- 
cause huge errors can be introduced in the estimates when 
the frequency of computation is high and insufficient sam- 
ples are accumulated between computations. In our im- 
plementation, we set 6, = 0.5 and we observe experi- 


mentally that we can obtain relatively good estimates (to 
within 25% of the true value) in the steady state with our 
experimental parameters. 


C. Stabilization 


When multiple nodes attempt to join the Chord ring at 
approximately the same location, temporary inconsisten- 
cies may arise in the address space. Also, as nodes fail and 
leave the network unannounced, segments of the address 
space may become orphaned (i.e., none of the nodes know 
that they are responsible for them). We run a weak stabi- 
lization protocol periodically to fix local inconsistencies 
in the address space and a strong stabilization protocol to 
detect and fix global inconsistencies. 

Definition 1: We say that the network is (i) 
weakly stable if, for all nodes u, we have 
predecessor(successor(u)) = u; (ii) strongly 
stable if, in addition, for each node u, there is no 
node v such that u < v < successor(u); and 
(iii) loopy if it is weakly but not strongly stable 
(see [20]). 

1) Weak Stabilization Protocol: All messages contain 
the IP address, port number and id of the sender. So un- 
like Chord, there is no longer a need for a node to explic- 
itly notify its successor that it is the new predecessor after 
it joins the network. When it contacts the successor to ini- 
tiate a cache transfer, the successor would realize that the 
new node has joined the network and update its predeces- 
sor pointer accordingly. 


[NT __[_N3__[Tookup terminates 


In addition, nodes periodically probe their immediate 
neighbors to check if they are still alive. When probed, 
a node will either (1) send a short reply message with its 
current predecessor and successor or (ii) send a complete 
list of its immediate neighborhood (k predecessors and k 
successors) if a change was detected within k hops of the 
probing node. 

Each node is responsible for finding and maintaining its 
own successor and predecessor. When a node hears from 
another node whose id is closer than its current predeces- 
sor and successor, the new node is automatically set as 
the predecessor or successor accordingly. If a node learns 
about a node that could possibly be its new predecessor or 
successor indirectly from another node (or by observing 
lookup traffic), the node will probe this new node and set 
it as the predecessor or successor only if it receives a pos- 
itive response on the probe. Periodically, each node will 
probe its perceived successor and predecessor (which may 
not be correct) to learn about the nodes’ neighborhoods. 
In this way, a node is eventually guaranteed to discover a 
better predecessor or successor in the vicinity of its td, if 
one exists. 


Theorem 1: The weak stabilization protocol 
will eventually cause an EpiChord network to 
converge to a weakly stable state. 


To prove this theorem, we observe that each node has 
only a finite number of possibilities (exactly n — 1) for 
its predecessor and successor. For a node wu such that 
predecessor(successor(u)) # u, u would eventually 
probe its successor and both would update their predeces- 
sor and successor pointers accordingly. Each predeces- 
sor/successor update event monotonically improves the 
consistency of the address space, i.e., anode only adopts a 
new predecessor or successor if it is strictly better than its 
previous successor. Therefore, the address space pointers 
will eventually converge to a weakly stable state, which is 
the only state where updates will no longer happen. 

2) Strong Stabilization Protocol: Although, it is in 
generally highly improbable that a network will end up 
loopy (except perhaps after a network partition), for com- 
pleteness, it is still desirable to have a scheme that will 
detect and fix global inconsistencies in the address space. 
Our strong stabilization algorithm is based on a very sim- 
ple idea: to detect loops, all we need to do is to traverse the 
entire ring and make sure that we come back to where we 
started. Figure 5 shows graphical example of a loopy, but 
locally consistent address space. In this example, node n 
forwards a packet containing its identifier along the ring. 
When the packet reaches node m, m realizes that n exists 
and initiates the weak stabilization protocol with n to re- 
pair the address space. A naive scheme to pass a single 


token along the ring will take a long time and is relatively 
inefficient, so instead, we implement a parallelized token- 
passing scheme. 


> 


Fig. 5. | An example of a loopy address space configuration. The 
arrows indicate the direction of the successor pointers. 


As loopy configurations are expected to be rare, strong 
stabilization needs to be performed only infrequently. The 
key idea in our strong stabilization protocol is to generate 
and pass q tokens (which are simply messages) along the 
ring using only the successor pointers. In our protocol, 
immediately after a node sees a stabilization token (or im- 
mediately after it joins the network), it will pick a random 
waiting period from the interval (tmin, tmaz) after which 
it will initiate strong stabilization. If a node sees a token 
before its timer runs out, it will reset its timer and choose 
again. In this way, we can control the number of concur- 
rent tokens that are passed in the ring at any given instant 
in time in a distributed fashion. 

To initiate the strong stabilization process: 


e anode x (with identifier n,.) picks q nodes with iden- 
tifiers n1,2,---, Mg, distributed approximately uni- 
formly in the address space, from its cache, where q 
is the degree of parallelization and nz < ny < ng < 

e x sends node ng a token with n, (itself) marked as 
the destination. 

e «x then proceeds to send node n,; a token with n;+1 
marked as the destination, for: = q—1,---,lin 
order. If a given node n; is found to have failed, 
another node in its vicinity is chosen instead. 

e finally, x generates a token with destination n; and 
passes it to its successor. 

This is illustrated in Figure 6. 

When a node receives a token, it passes the token to its 
successor. A token is destroyed when it reaches a node 
with an identifier greater or equal to its intended destina- 
tion (modulo the circular address space). When a token is 
destroyed, one of two possibilities can occur: 

1) the segment of the address space traversed by the 

token is not loopy, in which case, the token either 


Fig. 6. Example on the generation of q stabilization tokens. 


ends up at its intended destination or its successor 
(if the destination node failed at the meantime) and 
nothing happens. All the nodes in the path of the 
token would however have learned about the desti- 
nation node. 

2) the segment of the address space traversed by the 
token is loopy and the token does not end up at the 
intended destination. Again, however, the nodes in 
the path of the token would have learned about the 
destination node and as a result, two of the nodes 
on the alternate segment in the vicinity of the desti- 
nation node would start to probe for the destination 
node because of weak stabilization and the loop will 
eventually be eliminated. 


If the network is large, 5 is large and it will still take 
a long time (and many hops per token) to complete one 
round of token-passing. To avoid this problem, nodes 
generate secondary tokens. For example, node y with 
identifier n, receives a token destined for nz. Instead of 
just passing the token to its successor, a node can also 
choose q nodes with identifiers n,,2,--+,mgq such that 
Ny < my < ng < +++ < Nq < nz and generate the 
corresponding gq tokens. With this recursive process, each 
token-passing round can be completed in O(log n) time. 


Theorem 2: The combination of our recursive 
parallel token-passing algorithm with the weak 
stabilization protocol will cause an EpiChord 
network to converge to a strongly stable state 
after at most O(n”) rounds of token-passing. 


There are two key intuitions behind the correctness of this 
theorem. First, if the network is loopy, the token-passing 
algorithm will cause at least one pair of nodes to detect 
an inconsistency. Next, whenever such an inconsistency 
is detected, the pair of nodes that detect the inconsistency 
will update each other and strictly improve the state of the 
network. Since each node in the network has one correct 
successor and the only stable state is when the network is 
no longer loopy, we conclude that the network must even- 
tually become strongly stable. The bound is obtained from 


observing that each node has only n possible choices for 
its successor. Since each round of token-passing updates 
at least one node, we know that it will take at most O(n”) 
rounds to update the successor (and predecessor) pointers 
to the correct values. 

To see that the token-passing algorithm will allow at 
least one pair of nodes to detect an inconsistency, consider 
a network that is weakly stable, i.e., if we followed the 
successor pointers we would eventually end up where we 
started. Suppose we choose r nodes arbitrarily such nj < 
ng < +++ < n,. Take a node, say n, and follow the 
successor pointers. Repeat this process for all nodes. If 
the network is not loopy, it is clear that the node ids would 
increase monotonically (modulo the address space) until 
we reach node n,+1; If the network is loopy, for at least 
one node n,, we would eventually reach a node nz such 
that ny < ny41 < nz (modulo the address space). The 
key is to recognize that the net effect of our secondary 
token generation mechanism is to choose these r nodes 
recursively. 

Intuitively, it is quite easy to see that if we choose a set 
of r nodes in the ring and have them forward messages 
to adjacent nodes in this set along the ring, we can detect 
inconsistencies. What is interesting about our algorithm is 
that we have demonstrated that we can choose this set of 
r nodes recursively in a distributed way and still preserve 
the correctness of this approach. 


III. ANALYSIS 
A. Worst-Case Lookup Performance 


If we assume a uniformly distributed workload, we can 
show that the worst-case lookup performance is O(log n) 
hops. In addition, the expected worst-case lookup path 
length is at most 5 log, n, where a = 37 + rae Here, n 
is the size of the network, and 7 is the minimum number 
of cache entries per slice (see Appendix B). When 7 = 1, 
we get the same expected worst-case result as Chord does. 
However, for 7 > 2, we tend to do much better: for 7 = 2, 


a = 7.2 and the EpiChord expected lookup path lengths 


are at most only ane = log, 2 = $ of that for Chord’. 
Our analysis implicitly assumes that the queries in each 
hop are synchronized. Because our lookup algorithm is 
asynchronous, actual lookup path lengths will tend to be 


slightly larger. 


B. Reduction in Background Probes 


EpiChord exploits information gleaned from observing 
lookup traffic to improve lookup performance, and only 


"The expected lookup path length for Chord is s log, n [20]. 


sends network probes when necessary. To see the band- 
width savings with our approach, we consider a network 
with a steady state size of 20,000 nodes and nodes that 
have an median lifespan of 60 minutes®. This translates to 
a node failure rate of approximately 0.03% (or 5 nodes) 
per second. Assuming that the application-level lookup 
traffic received by a node is approximately uniformly dis- 
tributed (this is a reasonable assumption since node ids 
are obtained using the SHA-1 hash [19] and are thus uni- 
formly distributed), the proportion of lookup traffic that 
will help to satisfy the cache invariants for various val- 
ues of lookup traffic and 7 is shown in Figure 7. With an 
amount of lookup traffic approximately equal to the re- 
quired background maintenance traffic (i.e., x = 1 in Fig- 
ure 7), we can achieve a 35% reduction in the background 
maintenance traffic. At larger network sizes, the savings 
in background maintenance traffic is reduced. However, 
as shown in Figure 8, even at network sizes of 1,000,000 
nodes, we can still expect a reduction of more than 25% 
on average. 
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Fig. 7. Effect of 7 on the proportion of lookup traffic that helps to 
satisfy cache invariant for 20,000-node network. 


C. Cache Composition in the Steady State 


The proportion of live entries? in the cache is an im- 
portant system parameter because it determines the prob- 
ability of a timeout occurring during a lookup. To obtain 
an estimate of the number of live entries in a cache in the 
steady state, we consider a network of size n such that in 
a fixed time interval, a fraction r of the nodes in the net- 
work leave, a fraction f of the cache entries are flushed 


®These figures are representative of both the Napster and Gnutella 
peer-to-peer file-sharing networks as reported in a measurement study 
by Saroiu et al. [21]. 

° Anentty is live if its associated node is still online. The set of cache 
entries for a node will in general consist of some live entries and some 
unexpired, stale entries. 
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Fig. 8. Effect of network size (n) on the proportion of lookup traffic 
that helps to satisfy cache invariant (for 7 = 2). 


and each node makes @ lookups uniformly over the id 
address space and sends out p queries in parallel for each 
lookup. Where z is the number of live nodes that is known 
to a node at time t, we obtain the following relation: 


nodes departed but 


‘ incoming queries fel not flushed 
Cn c -——s 
ait) = pQd-=) - “fe - Q-fre  @ 


We have assumed that new knowledge comes only from 
the incoming queries as a node would have to know about 
a node in order to send an outgoing query to it. This is 
conservative and will tend to under-estimate the increase 
in xz. We have also assumed that the probability that a 
cache entry is flushed is independent of the probability of 
failure for the associated node. The steady state solution 
to x is: 


pQ 


t—oo 


(3) 


In addition, where y is the number of stale cache entries 
at time ¢, we have the following relation: 


stale entries discovered by 
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In a network with high churn, the proportion of stale en- 
tries in the cache, , is a key system parameter: 


pie gee (5) 


t-oarty pQ 


If pQ >> rnand f = 0, thena + nandy & va) = 0. This 
implies that if the level of lookup traffic is high enough, 
the performance of the system is somewhat independent 
of the cache maintenance protocol. This agrees with our 
intuition, since with a high level of lookup traffic, most 
nodes will know about a large number of other nodes and 
many stale cache entries would be discovered and elimi- 
nated during the lookup process. 

Next, we consider the case when pQ < rn. In the 
steady state, 


ee pQ 
lim a(t) & aeary: (6) 
By setting oy = 0 in (4), we obtain: 
fy’ +pQ+(f-r+rf)qy-GQ-f)re*?=0 7) 
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If cache entries are flushed at a rate that is at least as fast 
as the node failure rate, i.e. f  r, then 


Wax aA 1 
= ——_ < 1 —- — = 0.292 11 
rh =< Mee: 


Thus, our model predicts that even when the churn rate is 
high (pQ « rn), at most 30% of the cache entries will be 
stale (and this result is independent of the level of lookup 
traffic pQ). This result was verified by our simulations. 
To get a simpler close form for the expected proportion of 
stale entries, let f = cr, i.e., we flush entries at a rate that 
is c times faster than the node failure rate. Next, assume 
that f is small and 5 < 1, then 
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IV. SIMULATION RESULTS 


To understand the trade-offs when we move from an 
O(log n)-state-per-node DHT to an unlimited-state-per- 
node DHT with the same basic routing topology, we com- 
pare EpiChord to a corresponding optimal iterative Chord 
network of the same size using our simulation built on 
the ssfnet [22] simulation framework. We run the simu- 
lations on a 10,450-node network topology organized as 
25 autonomous systems, each with 13 routers and 405 
end-hosts. The simulated network topology is represented 
graphically in Figure 9. 


Fig.9. Simulation Network Topology. 


The average roundtrip time (RTT) between nodes in the 
topology is approximately 0.16 s. Hence, we set timeouts 
at 0.5 s for all simulations. The cumulative distribution 
of the RTTs in the simulation topology between any two 
pairs of nodes is shown in Figure 10. Since all query pack- 
ets are UDP-based and packets may be lost, we retransmit 
twice after a timeout and will decide that a node has failed 
if we do not hear from it after 3 tries. 
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Fig. 10. Cumulative distribution of RTTs in simulation topology. 


Li et al. highlighted that the assumed workload will 
affect the result of comparisons between DHTs signifi- 


cantly [23]. They proposed two generic classes of work- 
loads — lookup-intensive and churn-intensive. Although 
they did not propose exact definitions for these two classes 
of workloads, we do have a very natural way of defin- 
ing these two classes of workloads for EpiChord based on 
our steady-state cache model. In particular, we consider a 
workload to be lookup-intensive if pQ >> rn, and churn- 
intensive if pQ < rn. 

In our simulations, we first generate a sequence of 
node joins/departures and queries according to a pre- 
determined set of network parameters. Subsequently, we 
run the same set of traces on the EpiChord networks of 
varying degrees of parallelism and on a corresponding 
Chord network. This ensures that the results can be com- 
pared fairly across the two algorithms without bias in the 
choice of node ids and lookup 7ds. 


A. Lookup-Intensive Workload (pQ >> rn) 


In our lookup-intensive workload simulation, node 
lifespans are exponentially distributed with a mean of 
600 s. We experiment with a range of network sizes by 
varying the rate of node joins from 0.33 to 2 nodes per 
second. Each node in the network makes on average 2 
lookups per second. In steady state, the network sizes 
range from 200 to 1,200 nodes and the overall system 
query rate ranges from 400 to 2,400 lookups per second. 
The stabilization interval is 60s (i.e., nodes probe their 
successors and predecessors once a minute) and the life- 
time of a cache entry is 120 s. Since the expected back- 
ground maintenance traffic is negligible compared to the 
active lookup rate, Q ~ 2 and rn ranges from 0.33 to 2. 
Also, r oof ® za (f > r) and j = 2. 

Since we are comparing EpiChord to Chord, we had 
to pick an appropriate cache entry TTL to ensure that 
the maximal background maintenance traffic generated by 
EpiChord does not exceed that for a corresponding Chord 
network. If we assume that nodes have exponentially dis- 
tributed (memoryless) lifetimes with mean 7’, where X 
is the random variable representing the time of failure, 
P(X < t) is given by: 

P(X<t)=e7T (15) 
Hence, if nodes have mean lifetimes of 600 s, cache en- 
tries would have to be probed at least once every 60 s to 
ensure that they have a 90% probability of being valid. 
It is thus reasonable to assume a periodic probe rate of 
at least 60s for a Chord network. The minimal rout- 
ing set for an EpiChord network of comparable size with 
j = 2 would have slightly less than 4 times as many en- 
tries. However, since the cache slices for EpiChord is 


symmetric, we need only half the number of probes re- 
quired by Chord and so with a mean cache entry TTL of 
120s > 2 x 60s, the maximum background maintenance 
traffic for the EpiChord networks (even in the absence of 
lookups) in our simulations are guaranteed not to exceed 
that for the corresponding Chord network. 

1) Lookup Performance: The average latency and the 
average hop count per lookup for successful lookups in 
the steady state are shown in Figures 11 and 12 respec- 
tively. From Figure 11, we see that having more paral- 
lelism reduces the lookup latency. In Figure 12, the hop 
count for EpiChord is defined as the minimum number 
of nodes that have to be contacted in the final (successful) 
lookup sequence. We see that the average steady-state hop 
count varies from 1.1 to 1.4. This means that at least 60% 
of the lookups succeed within the initial wave of lookup 
queries. This result is actually not surprising since we 
know from our analysis that the expected worst-case hop 
count is 5 log, n= 5 log7.2 1,200 = 1.80. 

We consider a lookup to be successful if it locates the 
correct node within 5 minutes. This means that it is oc- 
casionally possible for a lookup to time out while waiting 
for the response from some failed node, and then subse- 
quently proceed to continue the lookup process with an- 
other node and succeed. The distribution of latencies is 
thus strongly bi-modal, with the majority of lookups suc- 
ceeding relatively quickly while a small fraction succeed- 
ing only after a timeout. These timeouts explain why the 
average latency as reported in Figure 11 for the 1-way Epi- 
Chord network is about twice the average RTT instead of 
being approximately equal to the RTT even though the 
hop count is 1.4. The timeout probabilities are shown in 
Figure 14. The latencies for successful lookups that ex- 
perience timeouts is generally more than 10 times of that 
for lookups that do not time out, though fortunately, the 
former occur much more infrequently than the latter. 

As shown in Figure 13, the lookup failure rates are rela- 
tively low (< 0.1%). This is not surprising since under the 
lookup-intensive workload, the large number of lookups 
keep the routing state for most nodes mostly up-to-date. 
Adding more parallelism (increasing p) reduces the proba- 
bility of lookup failure significantly!®. The lookup failure 
probability falls by approximately an order of magnitude 
when p is increased by one. 

2) Message Count: It is clear that a parallel lookup al- 
gorithm will generate more lookup messages when there 
are more parallel queries per lookup. Figure 15 shows 
that for our given parameter settings, the average number 
of query and reply messages that are required for a se- 
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Fig. 11. Comparison of lookup latency between Chord and p-way 
EpiChord under lookup-intensive workload. 
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Fig. 12. Comparison of lookup path length between Chord and p-way 
EpiChord under lookup-intensive workload. 


quential Chord network is approximately equal to that for 
a 3-way EpiChord network. As mentioned in Section II- 
A, the main reason why the number of lookup messages 
does not increase in proportion with p is that with iterative 
lookups, the querying node can avoid sending duplicate 
and redundant queries. 


B. Churn-Intensive Workload (pQ « rn) 


In our churn-intensive workload simulation, node life- 
spans are exponentially distributed with a mean of 600 s. 
The stabilization interval is 60 s and the lifetime of a cache 
entry is 120 s. We experiment with a range of network 
sizes by varying the rate of node joins from 1 to 15 nodes 
per second. Each node in the network makes on aver- 
age 0.01 lookups per second. Because the lookup rate is 
so low, most of the lookups captured in our results are 
lookups arising from node joins and cache maintenance. 
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Fig. 13. Lookup failure rates for p-way EpiChord networks under 
lookup-intensive workload. 
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Fig. 14. Probability of timeouts for p-way EpiChord networks under 
lookup-intensive workload. 


In steady state, the network sizes range from 600 to 9,000 
nodes and the overall system query rate ranges from 40 to 
700 lookups per second. Including the minimal expected 
background maintenance traffic, Q ~ 0.05 to 0.08 and rn 
ranges from 1| to 15. As before, r ~ san fr To (f >7r) 
and 7 = 2. 

1) Lookup Performance: The average latency and the 
average hop count per lookup for all successful lookups 
are shown in Figures 16 and 17 respectively. Again, we 
see from Figure 17 that adding more parallelism reduces 
the lookup latency significantly. As shown in Figure 18, 
the lookup failure probabilities under the churn-intensive 
workload are higher than those under the lookup-intensive 
workload (which are < 0.1%). This is to be expected 
since the churn rate is higher and the information prop- 
agation rate is significantly lower. From Figure 18, we 
see that the failure rates for the 4- and 5-way EpiChord 
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Fig. 15. Comparison of lookup message count between Chord and 
p-way EpiChord under lookup-intensive workload. 
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Fig. 16. Comparison of lookup latency between Chord and p-way 
EpiChord under churn-intensive workload. 


networks are higher than that for the 3-way EpiChord net- 
work, which is somewhat counter-intuitive. We discov- 
ered that the cause of this phenomenon is that with a larger 
p, each lookup invoked for cache maintenance satisfies 
the cache invariant for more nodes and so the 4- and 5- 
way EpiChord networks generate fewer cache-refreshing 
lookups than a 3-way EpiChord network. This lower 
rate of background maintenance traffic accounts for the 
marginally higher failure rates for larger network sizes. 
As shown in Figure 19, with p > 2, successful lookups 
will almost never experience timeouts. 

2) Message Count: As shown in Figure 20, more mes- 
sages are required to complete a lookup under a churn- 
intensive workload. However, the increase in message 
count over the lookup-intensive workload is quite mod- 
est: a l-way EpiChord network requires approximately 
the same number of messages per lookup as the corre- 
sponding Chord network, while a 3-way EpiChord net- 
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Fig. 17. Comparison of lookup path length between Chord and p-way 
EpiChord under churn-intensive workload. 
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Fig. 18. Lookup failure rates for p-way EpiChord networks under 
churn-intensive workload. 


work incurs approximately 50% more lookup traffic. 


C. Cache Composition 


Figures 21 and 23 show the average number of live and 
stale entries in the caches of the nodes for the EpiChord 
networks under a lookup-intensive workload and a churn- 
intensive workload respectively, while Figures 22 and 24 
show the fraction of stale entries in the respective caches. 
These results seem to support the conclusion from our 
analysis in Section III-C that the fraction of stale entries 
depends only on the node failure rate and the frequency at 
which entries are flushed from the cache. 

According to our cache model for churn-intensive 
workloads, 
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Fig. 19. Probability of timeouts for p-way EpiChord networks under 
churn-intensive workload. 
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p-way EpiChord under churn-intensive workload. 


1 
= eed 

This means that the predicted fraction of stale cache en- 
tries in the steady state is approximately 9% for the ex- 
pected node lifespans and cache flush rates in our exper- 
iments. From Figures 23 and 24, we see that our esti- 
mate of 9% is somewhat smaller than the actual value 
(= 12.5%). This is likely due to the neglected terms in our 
approximation and also to the fact that f is smaller than 
To in practice, i.e., although cache entries are flushed ev- 
ery 120 s, the probably of a cache entry being flushed out 
every second is smaller than a 


D. Effect of Lookup Traffic 


To investigate the effect of lookup traffic on lookup per- 
formance, we hold p and / constant at 3 and vary the 
amount of lookup traffic per node @ between 0.01 and 
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Fig. 21. | Cache composition for p-way EpiChord networks under 
lookup-intensive workload. 
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Fig. 22. Fraction of stale entries for p-way EpiChord networks under 
lookup-intensive workload. 


2.0 per second for a range of networks with sizes from 
600 to 1,200 nodes. As shown in Figures 25, 26 and 27, 
increasing the amount of lookup traffic reduces the lookup 
path length, lookup latency, and the number of messages 
sent per lookup. There are however decreasing marginal 
returns with increasing traffic and the EpiChord lookup 
algorithm achieves close to optimal performance with a 
reasonably small amount of lookup traffic (i.e., Q = 0.5). 


E. Effect of Number of Entries Returned Per Query 


To investigate the effect of the number of “best entries” 
returned per response, /, on lookup performance, we hold 
pconstant at 3 and the amount of lookup traffic Q constant 
at 0.01 per node per second (to minimize the lookup-traffic 
effect) and vary | between 2 and 4 for a set of network with 
sizes from 600 to 1,200 nodes. As demonstrated by our 
results shown in Figures 28, 29 and 30, | has a negligible 
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Fig. 23. Cache composition for p-way EpiChord networks under 
churn-intensive workload. 
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Fig. 24. Fraction of stale cache entries for p-way EpiChord networks 
under churn-intensive workload. 


effect on the lookup path length, lookup latency and the 
number of messages sent per lookup. We thus conclude 
that we can keep / small and set / = 3. 


V. DISCUSSION 


Our analysis and simulations have shown that by us- 
ing parallel lookups and by amortizing the network main- 
tenance costs into the lookup costs, our approach offers 
significantly better lookup path lengths and latencies with 
little additional costs in terms of bandwidth consumption. 
Our simulations have also shown that even though mul- 
tiple messages are sent per lookup step, the lookup traf- 
fic generated is not significantly larger than that for a se- 
quential lookup algorithm because the lookup path lengths 
are significantly shorter. In fact, the lookup traffic gener- 
ated by a 3-way EpiChord network is comparable to that 
for a corresponding Chord network. This is a desirable 
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Fig. 25. Comparison of lookup latency between Chord and 3-way 
EpiChord under varying amounts of traffic. 
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Fig. 26. Comparison of lookup path length between Chord and 3-way 
EpiChord under varying amounts of traffic. 


trade-off because lookup latency is the principal measure 
of lookup performance. 

Our new algorithm yields substantial savings in terms 
of setup time and the number of messages sent when a 
node first joins the network, compared to Chord and many 
other DHTs. To join the network, a node need only per- 
form one lookup, contact its successor and predecessor, 
and perform an initial cache transfer'!. Although perfor- 
mance is better with a full initial cache transfer, a min- 
imal transfer of O(log n) entries is sufficient to guaran- 
tee worst-case O(log n)-hop lookup performance. In con- 
trast, O(log n) lookups (O(log? n) messages) are required 


Adjacent nodes in an EpiChord network usually have a similar 
set of address space slices for their cache invariants. This means that 
after a node completes a cache transfer from either its successor or 
predecessor, it will generally have a cache that already satisfies the 
invariant. 
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Fig. 27. Comparison of lookup message count between Chord and 
3-way EpiChord under varying amounts of traffic. 
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Fig. 28. Effect of J on lookup latency for a 3-way EpiChord network. 
for a Chord node to fully initialize its finger table. 

Although our reply messages will tend to be larger than 
those of traditional sequential lookup algorithms, since | 
“best” entries are returned, even with the increase in size, 
the reply messages are only about 100 bytes in size (in- 
cluding the 28-byte UDP/IP header) at a reasonable set- 
ting of ! = 3. Hence, the increased size of the responses 
is not an issue even for nodes behind a 56k modem line 
since the packets are relatively small. 


VI. RELATED WORK 


Our parallelized lookup algorithm and reactive cache 
management strategy can be applied to any of the existing 
DHT routing topologies that have some flexibility in the 
choice of neighbors (i.e., ring, tree or xor) [14]. We chose 
to implement our proof-of-concept DHT using the Chord 
ring [2] as the underlying routing topology because of its 
simplicity. 
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Effect of / on lookup path length for a 3-way EpiChord 
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Effect of / on lookup message count for a 3-way EpiChord 


Like EpiChord, Kademlia [6] gathers routing infor- 
mation from observing lookup traffic and uses parallel 
lookups to improve lookup resilience. The organization of 
its routing entries is also somewhat analogous to that for 
EpiChord, albeit in a different address space. One key dif- 
ference between Kademlia and EpiChord is that Kademlia 
limits the amount of routing state to O(log n) while Epi- 
Chord does not. By limiting its routing state to O(log n), 
Kademlia lookups take on average O(log) hops while 
EpiChord can often achieve one- or two-hop lookup per- 
formance with its large routing state. While Kademlia 
employs parallel lookups mainly to improve lookup per- 
formance, EpiChord actually requires parallel lookups to 
cope with possible timeouts arising from maintaining a 
large amount of routing state. 

The MIT Chord [20] implementation includes a loca- 
tion cache, i.e., nodes remember the IP address and ids 


of nodes that recently contacted them and use this infor- 
mation in their lookup. Zhuang and Zhou showed that the 
Chord location cache is able to reduce lookup path length 
by 1/2 of the logarithm of the cache size, but it does not 
scale to more than 2,000 nodes in a typical network set- 
ting because of stale cache entries, which cause timeouts 
and redundant hops [24]. 

In addition to proximity neighbor selection [14], Dabek 
et al. recently investigated the effectiveness of a com- 
bination of techniques in improving lookup latency for 
DHash++ [18] (an O(log n)-state DHT based on Chord), 
including synthetic coordinates [15], erasure coding [17], 
integration of key lookups and data fetches and an inte- 
grated transport protocol (STP). EpiChord is certainly not 
as sophisticated, but we are not seeking to be. Most of the 
techniques in DHash++ are orthogonal to our lookup al- 
gorithm and can be integrated into EpiChord if so desired. 

Gupta et al. proposed one- and two-hop schemes that 
disseminate global network membership changes using a 
background broadcast process that scales up to a million 
nodes [13]. Other two-hop schemes that have been pro- 
posed include Kelips [9] and Structured Superpeers [12]. 
The major drawbacks of these schemes are that they either 
impose a fixed (and relatively high) amount of constant 
background traffic on all nodes (even ones that are rela- 
tively inactive), and/or impose significant asymmetry in 
the bandwidth consumption across nodes in the network. 
In return, they are in general able to achieve somewhat 
better one- and two-hop lookup performance than Epi- 
Chord, which also often achieves O(1)-hop lookups, but 
only in an incidental and laissez faire manner and at a 
somewhat lower cost. 

To the best of our knowledge, only Chord [20] has a 
strong stabilization algorithm that will provably fix loopy 
network configurations and their stabilization algorithm 
is specific to their lookup algorithm and cannot be ap- 
plied generally to other DHT routing algorithms. Our 
token-passing stabilization mechanism can be applied to 
any DHT that has a circular address space. 


VII. FUTURE WORK 


Instead of limiting the number of concurrent queries 
that we allow a lookup to have in parallel at any instant in 
time to p, it might be desirable to let the number of concur- 
rent queries be Pmax(> p) if the number of nodes in the 
network is large and the node caches are relatively sparse, 
since under such circumstances, the initial p nodes are 
separated from the node corresponding to id by many in- 
termediate nodes. Having more concurrent queries Pmax 
improves lookup latency and allows the querying node to 
learn about more nodes, thereby improving the quality of 
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its node cache. Of course, there is a trade-off of increased 
lookup traffic. 

Conceptually, 7 can be used to adaptively adjust the 
cache entry expiration period. We can choose a target 74 
and the cache entry expiration period is incrementally de- 
creased when Y > yj, until Y < jy. We have not im- 
plemented such a scheme, but it is straightforward to do 
SO. 

EpiChord is currently not fully optimized. There is still 
significant flexibility for nodes to adopt individual poli- 
cies to further enhance and optimize their individual (and 
thereby global) lookup performance, if so desired. For ex- 
ample, a node that discovers a high rate of node failures 
within the network (i.e., from the fact that many queries 
are unacknowledged) can adaptively increase the number 
of parallel queries per lookup as well as be more aggres- 
sive in flushing old entries from its cache. One can also 
imagine improving the dissemination of routing state by 
piggybacking additional random node entries on requests 
or responses. If the system lookup rate is low or a higher 
level of background traffic can be tolerated, EpiChord 
can generate additional queries, or employ a hierarchical 
broadcast scheme [13] or a provably efficient epidemic 
cache exchange mechanism [25], to proactively increase 
the number of cached entries per node. Finally, it might 
perhaps be possible to formulate the performance opti- 
mization problem as a learning problem and apply some 
existing AI technique to optimize overall system perfor- 
mance by tuning system parameters at runtime depending 
on the operating conditions and constraints (i.e., amount 
of lookup traffic and available background bandwidth). 


VIII. CONCLUSION 


Our goal in this work is not to design the perfect DHT. 
Instead, our objectives are: (i) to explore the effectiveness 
of our new technique, where we combine parallel queries 
with a reactive cache management strategy, in allowing 
us to move from an O(log n)-state-per-node DHT topol- 
ogy to an unlimited-state-per-node architecture; and (ii) to 
understand the trade-offs within the unlimited-state-per- 
node DHT design space. 

Proximity routing has been shown to be effective in re- 
ducing DHT routing latency [14]. Although we do not 
track latency information or actively decide on which 
nodes to query based on proximity, our parallel asyn- 
chronous lookup approach in fact exploits proximity indi- 
rectly. The key observation here is that the final sequence 
of lookups that returns the correct answer first in our 
asynchronous parallel lookup algorithm is approximately 
equivalent to a proximity-optimized lookup sequence for 
the corresponding sequential lookup algorithm. 


Our parallel lookup algorithm is simple and effective, 
and our reactive approach to routing state maintenance 
allows our DHT to adapt naturally to a range of lookup 
workloads. We have quantified the performance-cost 
trade-offs for our lookup algorithm and showed that we 
can reduce both lookup latencies and path lengths by a 
factor of 3 by issuing only 3 queries asynchronously in 
parallel per lookup and that the number of messages thus 
generated is in general no more than that for the corre- 
sponding sequential Chord lookup algorithm, and at most 
up to 50% more under high churn rates. 
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APPENDIX y, st. y € (v+2",2+2'+1). The range (x + 2°, 2 +2'*1) 

A. Pseudocode for Basic Lookup Algorithm is the size of one bucket in x’s cache. This means that 

Let tried_set= set of nodes that have already been probed (initially empty) we have at least J entries in the bucket and hence we can 
pending-set= set of queries that are currently pending (initially empty) certainly find node z s.t. z € (x +2*,2+ Qtr), 


answer= final answer for this query (initially null) 7 entries 


best_predecessor= best known predecessor of id probed A 
best_successor= best successor heard from for id probed is e+2 a+ 2% 
; ba ---- ee | 
//To start the successor node for the identifier id. 5 ig | 41 
findSuccessor(id) e+1 a+2 he i a2 


// Gets from cache the best known successor of id, 
// excluding entries already found in tried_set. 
try_entry < cache.getNext(id, tried_set); 
sendQuery(d, try_entry); 


Fig. 31. Analysis of expected worst-case lookup performance. 


for i —Ouptop—1 Since y € (x +2', 2 +2'*'), so we know that |x — y| < 
// Gets from cache the best known predecessor of id, grt but because z € (x +4 Or ot or); we know that 
// excluding entries already found in tried_set. <9i ; : 
irVoentry 2 cachepetPrev (id: thied.set): |z—y| < 2’. Hence, in each lookup step, even if the actual 
if try_entryé null distance to the destination id is not reduced, the maximum 
peut Oucen eee SIE: possible distance is steadily reduced by at least a factor of 
1 To send a query to node nto look up identifier id. two even in the worst case. This implies that lookups can 
sendQuery(id, n) . . 
// Sends a UDP packet to node n to lookup identifier id, be made in O(log n) hop s in the worst oe . 
// with information on the nodes currently being probed. The bound derived from the above analysis 1s very loose 
sendLookupMessage(id, n, pendingset) because it is based only on the assumption that there is at 
// Sets a timeout for node n. ; ; . 
setTimeout(n) least one other entry in the same bucket as the destination 
tried_set.add(n) node. Since we have at least 7 entries in the bucket, we 
ding_set.add ee : 
Revenee ene can clearly do significantly better. Under the assumption 
y do sig y assump 
// This function is called when node n receives a reply. that x, y and all 7 entries in the (x +2',a4+ Dre) bucket 
receiveReply(n, success, reply_set) : : wes 
// Add all the entries received from n to the cache. av independent and uniformly distributed, we can show 
cache.addEntries(reply-set) with some elementary probability that: 
pending_set.remove(n) 
if n.id € (owner.id,best_successor) E(\x _ y|) 
best_successor <— n ——§_—- = 334+ —— 
if success = true E\(\z — y|) J +3 
answer <— reply_set.getAnswer(); 
// return answer to the query. Proof: 
lookup-_success(answer); 
a sendMoreQueries(); (1 ee yy eS y 
: Pr(min>y|xz) = (l-2y/? ify<a<l-y 
// This function is called if a timeout occurs for the query to node n. (x = yy ifl-—y<z@ 
timeout(n) 
pending_set.remove(n) If x is uniformly distributed, 
sendMoreQueries(); 


1 
// This function is called to send out more concurrent queries, if necessary. ‘ ; 
ea NCR q if ™  Pr(min>y) = | Pr(min > y|x)p(a)dx 
// Gets from cache the node try-entry which closest to id : 
// such that try_entry.id € (best_predecessor, best_successor ), 
// excluding entries already found in tried_set. 
while (|pending_set| < Dmaz) \ (try_entry# null) 
if try_entry.id € (owner.id, id) 
best_predecessor < try_entry 
sendQuery(id, try_entry) If y< 0.5, 
try-entry — cache.getBestEntry(id, tried_set); 


I 


1 
: Pr(min > y|x)dx, since p(x) = 1 
0 


if |pending_set| = 0 ; y : l-y 
// return lookup failure. Pr(min 2 y) = if (1 ey y)’ dx + / (1 - 2y) dx 
lookup-failure(); 0 ; y 
+ is (x — y)' dx 
l-y 
; y 
B. Analysis of Expected Worst-Case Lookup Performance & | fe (=e ye] aoe yt Pe 
To analyze the expected worst-case lookup perfor- pated 0 
mance, we consider the following scenario. Suppose we 1 ae ‘. 
aie ae ee 
are at a node with 7d x and we are trying to resolve an id j+l 1-y 


= ——/(1—2y)it! ~ (1 — yt} 
gal fl ¥) ae a formly distributed, 


+((1— y) — y)(1 — 2y) 


1 : : 
sare [a — yt? — (1 — 2y)747] E(\x — yl) 
2 : 
= baw a eee | 
Ee B(\z— I) 
+(1—2y)’ 
2 bhi Sih s, E(le— yl) 
= ag OU eg te) E(\z— yl) 
If y > 0.5, 
l-y 1 
Pr(min>y) = | (-a-y)lde+ | (x — y)i dx 
0 y 
1 j+1 ae 
eee | aia ee eee 
ya-2-y) lt 
l al 
+ en] 
jtl 4 
2 
— —* (,~y)st1 
Fm aie 
Pr(min<y) = 1-—Pr(min> y) 
2(1 — y)? 
=>p(min) = +2(j —1)(1—2y)7, ify <0.5 
2(1—y)i, if y > 0.5 


E(min) = il yp(y)dy 
= jf va-w¥ay 
‘ 0.5 
+f 2G—-1u(-2uyFay 
~ save] 
+f Byam 


= Feu = ayy] - 


j+l 


05 9 4 ci 
ae J (1 — dy) i+ 
| aH! y)? "dy 


= = 2 a j+2 : 

. -sanoen" v) |, 
7 j-1 1-2 ye] 
Presutesn al 


j-1 
G+HDG+ | 2G+DG+2) 
j+3 
2G + DG +2) 


If we now consider the original scenario, where y,z € 
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(x + 2',a + 2'+'), by the fact that the node ids are uni- 


Qt ag girl 
2 
gi-l 4 9% 
ae 
G+ DG +2) 
37 +1) +2) 
j+3 


a 


+7 
j+3 


